-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-53157][CORE] Decouple driver and executor polling intervals #51885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Scratch that - added a comment below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+CC @dongjoon-hyun for thoughts as well.
@@ -359,6 +359,11 @@ package object config { | |||
.timeConf(TimeUnit.MILLISECONDS) | |||
.createWithDefaultString("10s") | |||
|
|||
private[spark] val DRIVER_HEARTBEAT_INTERVAL = | |||
ConfigBuilder("spark.driver.heartbeatInterval") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is mostly just for reporting metrics, I am more inclined to rename this from heartbeat to something which more clearly conveys the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about spark.driver.metricsReportingInterval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @mridulm 's comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see we already differentiate, spark.executor.metrics.pollingInterval
, I will call this spark.driver.metrics.pollingInterval
?
Thank you for pinging me, @mridulm . I'm catching up the community reviews this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the background of proposal sounds reasonable to me. I'll revisit later when the PR has the final code.
private[spark] val DRIVER_METRICS_POLLING_INTERVAL = | ||
ConfigBuilder("spark.driver.metrics.pollingInterval") | ||
.version("4.1.0") | ||
.fallbackConf(EXECUTOR_HEARTBEAT_INTERVAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more fields like spark.executor.metrics.pollingInterval
, e.g., .doc
and .timeConf
? In addition, it would be great if we can place this new configuration to somewhere near the driver configs live instead of here.
spark/core/src/main/scala/org/apache/spark/internal/config/package.scala
Lines 369 to 376 in 86dad83
private[spark] val EXECUTOR_METRICS_POLLING_INTERVAL = | |
ConfigBuilder("spark.executor.metrics.pollingInterval") | |
.doc("How often to collect executor metrics (in milliseconds). " + | |
"If 0, the polling is done on executor heartbeats. " + | |
"If positive, the polling is done at this interval.") | |
.version("3.0.0") | |
.timeConf(TimeUnit.MILLISECONDS) | |
.createWithDefaultString("0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added doc string, but we can't add both .timeConf
and .fallbackConf
and if user has non-default EXECUTOR_HEARTBEAT_INTERVAL
, we should fallback to matching behavior, thoughts?
private[spark] val DRIVER_METRICS_POLLING_INTERVAL = | ||
ConfigBuilder("spark.driver.metrics.pollingInterval") | ||
.doc("How often to collect driver metrics (in milliseconds). " + | ||
"If 0, the polling is done at the executor heartbeat interval. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah typo - should be If unset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @mridulm , sorry but I'm not sure this PR is going into the right direction. I'll leave this to you for the rest of reviews because it seems that LinkedIn requires this feature.
What changes were proposed in this pull request?
Add a config
spark.driver.metrics.pollingInterval
, and schedule driver polling interval / heartbeat at that schedule.Why are the changes needed?
Decouple driver and executor heartbeat intervals. Due to sampling frequencies in memory metric reporting intervals we do not have a 100% accurate view of stats at drivers and executors. This is particularly observed at the driver, where we don't have the benefit of a larger sample size of metrics from N executors in application.
Here we can provide a way increase (or change in general) the rate of collection of metrics at the driver, to aid in overcoming the sampling problem, without requiring users to also increase executor heartbeat frequencies.
Does this PR introduce any user-facing change?
Yes, introduces a spark config
How was this patch tested?
Verified that metric collection was improved when sampling rates were increased, and verified that the number of events were expected when rate was changed.
Methodology for validating that increased driver heartbeat intervals would improve memory collection:
Was this patch authored or co-authored using generative AI tooling?
No